Context-Aware Wrapping: Synchronized Data Extraction
نویسندگان
چکیده
The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of “context awareness”: Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17-83% to 78-94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09-60% to 01-11%).
منابع مشابه
Context-aware Modeling for Spatio-temporal Data Transmitted from a Wireless Body Sensor Network
Context-aware systems must be interoperable and work across different platforms at any time and in any place. Context data collected from wireless body area networks (WBAN) may be heterogeneous and imperfect, which makes their design and implementation difficult. In this research, we introduce a model which takes the dynamic nature of a context-aware system into consideration. This model is con...
متن کاملIntegration Issues of an Ontology Based Context Modelling Approach
In this paper we analyse the applicability of our ontology based context modelling approach, considering a range of use cases. After wrapping up the model and the Context Ontology Language (CoOL) derived from it, we introduce some interesting applications of the language, based on a scenario showing the challenges in context aware service interactions. We focus on two submodels of our model for...
متن کاملApplications of a Context Ontology Language
In this paper we analyse the applicability of our Context Ontology Language (CoOL), considering a range of use cases. After wrapping up the model in use within this language, we introduce some interesting applications of the language, based on a scenario showing the challenges in context aware service interactions. We focus on two submodels of our model for context aware service interactions, n...
متن کاملContext-aware systems: concept, functions and applications in digital libraries
Background and Aim Among the places that context-aware systems and services would be very useful, are libraries. The purpose of this study is to achieve a coherent definition of context aware systems and applications, especially in digital libraries. Method: This was a review article that was conducted by using Library method by searching articles and e-books on websites and databases. Results:...
متن کاملUnstructured Data Integration through Automata-Driven Information Extraction
Extracting information from plain text and restructuring them into relational databases raise a challenge as how to locate relevant information and update database records accordingly. In this paper, we propose a wrapper to efficiently extract information from unstructured documents, containing plain text expressed with natural-like language. Our extraction approach is based on the automata for...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007